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Abstract 

Recently, the technologies of DNA sequence variation 
and gene expression profiling have been used widely as 
approaches in the expertise of genome biology and 
genetics. The application to genome study has been 
particularly developed with the introduction of the next- 
generation DNA sequencer (NGS) Roche/454 and lllumina/ 
Solexa systems, along with bioinformation analysis tech- 
nologies of whole-genome de novo assembly, expression 
profiling, DNA variation discovery, and genotyping. Both 
massive whole-genome shotgun paired-end sequencing 
and mate paired-end sequencing data are important 
steps for constructing de novo assembly of novel ge- 
nome sequencing data. It is necessary to have DNA se- 
quence information from a multiplatform NGS with at 
least 2 x and 30 x depth sequence of genome coverage 
using Roche/454 and Illumina/Solexa, respectively, for 
effective an way of de novo assembly. Massive short- 
length reading data from the Illumina/Solexa system is 
enough to discover DNA variation, resulting in reducing 
the cost of DNA sequencing. Whole-genome expression 
profile data are useful to approach genome system bio- 
logy with quantification of expressed RNAs from a whole- 
genome transcriptome, depending on the tissue sam- 
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pies. The hybrid mRNA sequences from Rohce/454 and 
Illumina/Solexa are more powerful to find novel genes 
through de novo assembly in any whole-genome se- 
quenced species. The 20 x and 50 x coverage of the 
estimated transcriptome sequences using Roche/454 
and Illumina/Solexa, respectively, is effective to create 
novel expressed reference sequences. However, only an 
average 30 x coverage of a transcriptome with short 
read sequences of Illumina/Solexa is enough to check 
expression quantification, compared to the reference ex- 
pressed sequence tag sequence. 

Keywords: de novo assembly, expression profiling, mul- 
tiplatform, NGS, resequencing, whole genome 

Introduction 

Scientists have tried to understand biology through DNA 
sequence information, since the DNA is verified as the 
unit of genetic heredity. Also, scientists hope to dramati- 
cally reduce the cost of reading genomic DNA and to 
obtain the high-throughput DNA sequence information. 
Since the automated DNA sequencers were developed 
with fluorescent dyes of different colors, laser, and com- 
puter technology in the 1980s, the human genome proj- 
ect (HGP) was begun in 1990, and the human genome 
was completely released in 2003, while further analysis 
is still being published. A total of about 3 billion dollars 
was invested to the project. In 1991, the National Human 
Genome Research Institute (NHGRI) funds were geared 
toward lowering the cost of DNA sequencing. Some of 
technologies invested improved the DNA sequencing. To 
date, the Applied Biosystems, Roche/454, and lllumina/ 
Solexa have successfully developed their technology 
and applied DNA sequencing in the world during the re- 
cent 6 years. Most of the newest technologies currently 
in use generate sequences from 36 to 1,000 base pairs, 
which requires special software for different applications, 
including whole-genome sequencing, transcriptome ana- 
lysis, and regulatory gene analysis. In particular, in silico 
method development using bioinformation software for 
next-generation sequence assembly would be alert for 
many genome projects and more applications in genome 
biology. Particularly, many biologists and geneticists us- 
ing massive DNA and RNA sequences have used the 
sequencing applications focused on the variable research 
fields in medicine, such as improved diagnosis of dis- 
ease; gene therapy; control systems for drugs, including 
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pharmacogenetics "custom drugs;" evolution; bioenergy 
and environmental applications of creative energy sour- 
ces (biofuels); clean up of toxic wastes, including effi- 
cient environmental sources against carbon; and agri- 
culture projects, including livestock of healthier, more 
productive, disease-resistant farm animals and breeding 
of disease-, insect-, and drought-resistant crops. In this 
paper, we report several ways of genome sequencing 
and expression profiling in genome biology. 

Current Next-generation Sequencing (NGS) 
Technology 

Roche/454 pyrosequencing technology 

The first high-parallel sequencing system was developed 
with an emulsion PCR method for DNA amplification 
and an instrument for sequencing by synthesis using a 
pyrosequencing protocol optimized on the individual well 
of a PicoTiterPlate (PTP) [1]. The DNA sequencing pro- 
tocol, including sample preparation, is supplied for a 
user to follow the method, which is developed by 454 
Life Science (now part of Roche Diagnostics, Mannheim, 
Germany). Even following the manufacturer's protocol, 
to obtain high-quality DNA sequence with maximal total 
DNA sequence length as possible, one should make a 
library with both adapters on the sheared DNA fragment 
and mix the optical ratio of library DNA versus bead for 
emulsion PCR. Now, the system is upgraded to through- 
put a total of an average of 700 Mbp with an average 
600 bp/read in one PTP run. The sequencer is available 
for single reads and paired-end read sequencing for the 
application of eukaryotic and prokaryotic whole-genome 
sequencing [2, 3], metagenomics and microbial diversity 
[4, 5], and genetic variation detection for comparative 
genomics [6]. 

Illumina/Solexa technology 

The numerous cost-effective technologies were being 
developed for human genome resequencing that can be 
aligned to the reference sequence [7, 8], The first suc- 
cessful technology to gain massive DNA sequencing 
available for resequencing was developed by Solexa 
(now part of lllumina, San Diego, CA, USA). The princi- 
ple of the system is the method of base-by-base se- 
quencing by synthesis, where the sheared template DNA 
is amplified on the flat surface slide (flow cell) and de- 
tects one base on each template per cycle with four 
base-specific, fluorescently labeled signals. Signals for 
all four fluorescent channels are collected and plotted at 
each position, enabling quality per scores to be derived 
using four-color information if desired [8]. Now, a max- 



imum of 300 Gb for reading is available with 101 bp 
paired-end reading per fragment on a 1-flow cell run in 
the upgraded HiSeq system. 

Application of NGS to Genome Research 

Novel whole genome de novo assembly 

More than 11,000 sequencing projects, including tar- 
geted projects, were reported on the Genome Online 
Database (GOLD, http://www.genomesonline.org) in ear- 
ly 2012. Now, more than 3,000 genome projects have 
been completed on the diverse genome species, and 
more than 90% of completed projects were bacterial 
genome sequencing. The greatest bacterial genome se- 
quencing was performed with 454 pyrosequencing be- 
cause of the available largest long read sequencing, 
useful for de novo assembly of novel genome sequen- 
cing. The official depth of the deep sequencing strategy 
of 454 pyrosequencing technology for whole bacterial 
genome sequencing for de novo assembly in novel ge- 
nome sequencing is at least 15-20x in depth of the es- 
timated genome size [9-13]. However, Li et al. [3] re- 
ported that 6-1 Ox sequencing in qualified runs with 
500-bp reads would be enough for de novo assemblies 
from 1,480 prokaryote genomes with >98% genome 
coverage, < 1 00 contigs with N50, and size > 1 00 kb. 
Recently, prokaryote whole genome sequencing using 
101 bp paired-end read data from Illumina/Solexa sys- 
tems was used for de novo assembly and resequencing. 
For example, a Bacillus subtilis subspecies genome se- 
quence was generated by using the short read sequence 
from Illumina/Solexa and assembled with the Velvet pro- 
gram [14], In this case, the genome assembly was com- 
pleted, based on the reference genome for ordering the 
numerous contigs derived from de novo assembly. Even 
though numerous contigs assembled with Illumina/Solexa 
data were produced in the eukaryotic genome, a few 
drafts for the assembled genome sequence were re- 
ported, except for the giant panda genome [15], which 
was covered with assembled contigs (2.25 Gb), covering 
approximately 94% of the expected whole genome. 
Another example was the woodland strawberry genome 
(240 Mb) [16] that was sequenced to 39 x depth of the 
genome, assembled de novo, and anchored to the link- 
age map of seven pseudochromosomes. 

The genome sequence could be associated with the 
predicted genes with transcriptome sequence data. An 
ideal method for cost-effective novel genome sequenc- 
ing using NGS is de novo assembly with diverse shot- 
gun fragment end sequencing data of multiplat systems 
(Fig. 1). The first strategy of novel genome DNA sequen- 
cing is sequencing the genomic DNA for contig and scaf- 
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Fig. 1. Stringency of whole-genome DNA shotgun sequencing for novel genome. Whole-genome shotgun sequencing (left of 
the Figure) for contig construction: Sequencing of single-end or paired-end fragment of whole-genome DNA shotgun library, 
which are made with the average range of 200-800-bp fragments for Roche/454 or Illumina/Solexa systems. In general, pro- 
ducing a total DNA sequence amount of 15-20x and 60 x coverage in depth of a genome depends on using Roche/454 or 
Illumina/Solexa, respectively. Whole-genome shotgun mate paired-end sequencing for scaffold construction (right of the 
Figure): sequencing of the mate paired-end fragments of the whole-genome DNA shotgun library, which are made with the 
average range of 2-40-Kb fragments for next-generation DNA sequencer. The sequencing amount of more than 20 x cover- 
age in depth of the genome is effective for scaffold constructions. 



fold construction after randomly sheared shotgun single 
read-end or paired-end read DNA sequencing using 
Roche/454 or Illumina/Solexa with information on how to 
assemble with the NGS data using variable assembly 
software. Recently, a catfish genome was sequenced with 
multiplatform Roche/454 and Illumina/Solexa technology 
and assembled with an effective combination of low 
coverage depth of 18 x Roche/454 and 70 x lllumina/ 
Solexa data using 3 assembly softwares - Newbler soft- 
ware to the 454 reads, Velvet assembler to the lllumina 
read, and MIRA assembler for final assembly of contigs 
and singletons derived from initial assembled data - re- 
sulting in 193 contigs with an N50 value of 13,123 bp 
[2]. In an additional multiplatform data assembly of a 
40-Mb eukaryotic genome of the fungus Sordaria mac- 
rospra, a combination sequence of 85-fold coverage of 
Illumina/Solexa and 10-fold coverage by Roche/454 se- 
quencing was assembled to a 40-Mb draft version (N50 
of 117 kb) with the Velvet assembler as a reference of 



a model organism for fungal morphogenesis [17], In the 
recent effective assembly methods reported, combina- 
tions of the multiplatform sequence are shown as suc- 
cessful novel genome assembly using variable assembly 
strategy pipelines. Comparing the pipeline of assembly 
strategy, we suggest an effective integrated pipeline in 
which data are filtered to remove low-quality and short- 
read initial assemblies using variable software and then 
compared to contigs, hybrid contigs using MIRA assem- 
bler, and finally contig orders using SSPACE software 
(http://www.baseclear.com/dna-sequencing/data-analysis/) 
[18] for scaffold construction through de novo assembly 
of novel genome sequencing (Fig. 2). According to the 
comparison of several ways of de novo assembly, we 
suggest using both DNA sequences from multiplatform 
NGS with at least 2 x and 30 x depth sequences of ge- 
nome coverage using Roche/454 and Illumina/Solexa, 
respectively, and doing hybrid assembly for cost-effec- 
tive novel genome sequencing. 
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Fig. 2. Integrated pipeline for 
de novo assembly of novel ge- 
nome sequencing. The scheme 
is filtering data to remove low- 
quality and shot-read initial as- 
semblies using variable soft- 
ware and compare to contigs, 
hybrid contigs using MIRA as- 
sembler, and contig ordering 
using SSPACE software to 
scaffold construction. 



SNP discovery and genotyping with resequencing 

Resequencing of genomic regions or target genes of in- 
terest in a phenotype is the first step in the detection 
of DNA variations associated with the gene regulation. 
The discovery of single-nucleotide polymorphisms (SNPs) 
including insertion/deletions (indels), with high-throughput 
data is useful to study genetic variation, comparative 
genomics, linkage map, and genomic selection for 
breeding value with DNA variation. Many geneticists for 
biological and genome studies of microbial, plant, ani- 
mal, and human genomes have effectively used NGS 
whole-genome resequencing data to use in variable re- 
search fields, such as bacterial evolution [19], genome- 
wide analysis of mutagenesis of Escherichia coli strains 
[20], comparative genomics of Streptococcus suis of 
swine pathogen [21], genomic variation effects on phe- 
notype and gene regulation in mouse [22], evolution of 
plant [23], and comparison of genetic variations on the 
targeted enrichment [24]. The platforms of resequencing 
projects have used Illumina/Solexa of short read lengths 
to align with the reference sequence to discover DNA 
variations between compared related species' sequen- 
ces. Because of rare occurrence of SNPs in most spe- 
cies, it is important to identify high-accuracy data to dis- 
cover DNA variations according to coverage depth using 
MAQ (http://maq.sourceforge.net/maq-man.shtml) [25] and 
CLC software (http://www.clcbio.com). The public proto- 
col of covering depth to discover SNPs and indels on 
the heterogeneous genome requires at least 30 x of the 
reference genome, while about 10 x depth of coverage 
is enough for DNA variation study of homogeneous 
genomes. Of course, high coverage of depth provides 



high-quality data in SNP detection on the reference 
mapping (Fig. 3). However, short read lengths of 35 bp 
or 100 bp show enough to map on the reference se- 
quence using the MAQ software and CLC software in 
the genome, including short repeated block regions. 
But, geneticists still require long-read sequencing data 
to distinguish repeated block regions, like paralogous 
regions derived from gene duplication. MAQ software 
provides a consensus sequence of the genotype se- 
quenced of short read lengths with aligned raw reads to 
the reference sequence. CLC software checks accuracy 
by counting reads of DNA variations of each position. 
Recently, a novel application of pattern recognition for 
accurate DNA variations was discovered in the com- 
plexity of the genomic region using high-throughput da- 
ta in a Caucasian population [26]. They used three in- 
dependent datasets with Sanger sequencing and Affy- 
metrix and lllumina microarrays to validate SNPs and in- 
dels of a clinical target region, FKBP5. Therefore, it is 
necessary for multiplatform systems to validate DNA 
variations in the specific complexity of the genome 
region. 

Expression profiling 

Gene expression profiling is a measurement of the regu- 
lation of a transcriptome from the whole genome in the 
field of molecular biology. A conventional method to 
measure the relative activity of target genes is DNA mi- 
croarray technology, which estimates expressed genes 
with the signals of hybridization of target genes (cDNA 
from mRNA) on the synthesized oligonucleotides [21], 
The technology is still used for functional genomics in 
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Fig. 3. View of single nucleo- 
tide polymorphism (SNP) dis- 
covery through mapping short 
reads from Illumina/Solexa to 
reference sequence on MAQ 
software (A) and CLC software 
(B). (A) Short read 35 bp per 
read of soybean genome shows 
completely mapped on the soy- 
bean reference sequence. The 
MAQ software provides a con- 
sensus sequence of the geno- 
type sequenced of short read 
lengths with aligned raw reads 
to the reference sequence. (B) 
CLC software is useful for 
counting reads with DNA var- 
iations at each position. 



the wide era, including medicine, clinic, plant, and agri- 
cultural biotechnology [28-30]. In addition, microarray 
technology is also used in the comparative study of 
proteomics and expression, measuring the level of ex- 
tracellular matrix protein [30]. Since NGS technology was 
developed in 2005, the transcriptome of novel whole ge- 
nomes could be identified with massive parallel mRNA 
sequencing using Roche/454 and Illumina/Solexa [31-36]. 
The Roche/454 system is more useful for gaining novel 
gene discovery of novel species' genomes for long read 
sequencing [37, 38]. Otherwise, Illumina/Solexa is being 
used to profile the expression of known genes with 
mapping short read sequences to the known reference 
genes [39, 40]. In that case, rare expressed genes and 
novel genes could be identified with high-throughput ex- 
pressed sequence tag sequences using Illumina/Solexa. 
Also, it is useful to find significant tissue-specific ex- 
pression biases with comparison of transcript data [22], 
Now, the hybrid mRNA sequence from Rohce/454 and 
Illumina/Solexa is more powerful for finding novel genes 
through de novo assembly in any whole-genome spe- 
cies. 

The hybrid sequence data of 20 x and 50 x coverage 
of the estimated transcriptome sequence from Roche/ 
454 and Illumina/Solexa, respectively, is effective in cre- 
ating novel expressed reference sequences, while short- 
read Illumina/Solexa data are cost-efficient on expres- 



sion quantification information for comparing exposed 
samples and natural phenotype samples through map- 
ping to the reference genes (Fig. 4). Only and average 
30 x coverage of transcriptome depth of short-read se- 
quences of Illumina/Solexa is enough to check expres- 
sion quantification, compared to reference expressed 
sequence tag sequences. The expressed information 
could be different, depending on the software using 
CAP3, MIRA, Newbler, SeqMan, and CLC. Therefore, the 
results should be compared according to variable pro- 
gram options to define robust expression profiling [41]. 
To date, a powerful tool of ChlP-on-chip is used for un- 
derstanding gene transcription regulation. Thus, 
two-channel microarray technology of a combination of 
chromatin immunoprecipitation could be used for ge- 
nomewide mapping of binding sites of DNA-interacting 
proteins [29]. In any NGS application, the transcriptome 
expression information would be more useful than com- 
plete genome information research with the lowest se- 
quencing budget for biologists to better understand 
gene regulation of related genetic phenotypes with the 
in silico method. Of in silico methods, conserved miRNA 
and novel miRNA discovery is available on the massive 
miRNAnome data in any species. Specially, the target 
genes of miRNA discovered could be robust information 
to approach genome biology studies. Transcriptome as- 
sembly is smaller than genome assembly and thus 
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Fig. 4. A scheme of transcriptome expression analysis through massively parallel signature sequencing (MPSS) technology 
and bioinformatics: The identification of expressed genes through hybrid de novo assembly with Roche/454 and lllumina/ 
Solexa data (left) and expressed level profiling through mapping the Illumina/Solexa sequence to the expressed sequence tag 
reference 



should be more computationally tractable but is often 
harder, as individual contigs can often have highly varia- 
ble read coverages. Comparing single assemblers, 
Newbler 2.5 performed the best on our trial dataset, but 
other assemblers were closely comparable. Combining 
different optimal assemblies from different programs, 
however, gives a more credible final product, and this 
strategy is recommended [41]. 

Conclusion 

NGS technology provides a cost-effective way of se- 
quencing for novel whole-genome sequencing, resequen- 
cing, and expression profiling. Rohce/454 pyrosequenc- 
ing is recommended to de novo assembly of whole pro- 
karyote novel genomes, while hybrid assembly with 
Illumina/Solexa sequences would be optimal for whole 
eukaryote genomes and transcriptome studies of non- 
model organisms. Also, Illumina/Solexa sequencing is 
useful in detecting DNA variation, mapping the short-read 
resequence to the reference genome and profiling ex- 
pressed genes in model organisms. Furthermore, the 
high-throughput NGS sequencing enables us to study 
with an in silico method in variable research application 
fields of molecular genetics, including population diver- 



sity and comparative genomics, in a short time. 
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